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Priority 

1 . Receipt is acknowledged of papers submitted under 35 U.S.C. 1 19(a)-(d), which 
papers have been placed of record in the file. 

2. Should applicant desire to obtain the benefit of foreign priority under 35 
U-.S.C. 1 19(a)-(d) prior to declaration of an interference, a translation of the foreign 
application should be submitted under 37 CFR 1 .55 in reply to this action. 



Specification 

3. Applicant is reminded of the proper language and format for an abstract of the 
disclosure. 

The abstract should be in narrative form and generally limited to a single 
paragraph on a separate sheet within the range of 50 to 1 50 words. It is important that 
the abstract not exceed 150 words in length since the space provided for the abstract 
on the computer tape used by the printer is limited. The form and legal phraseology 
often used in patent claims, such as "means" and "said," should be avoided. The 
abstract should describe the disclosure sufficiently to assist readers in deciding whether 
there is a need for consulting the full patent text for details. 

The language should be clear and concise and should not repeat information 
given in the title. It should avoid using phrases which can be implied, such as, 'The 
disclosure concerns," "The disclosure defined by this invention," "The disclosure 
describes," etc. 

4. The abstract of the disclosure is objected to because it contains legal term often 
used in paten claim: "comprises" and "means". Correction is required. See MPEP 

§ 608.01(b). 
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Drawings 

6. The drawings are objected to because of the following minor informality: 

In Fig. 10D, the expression (2.C, +C) should have read (2.C, +c) because the 
added entry c is a leaf node. 

A proposed drawing correction or corrected drawings are required in reply to the 
Office action to avoid abandonment of the application. The objection to the drawings 
will not be held in abeyance. 

Claim Rejections - 35 USC § 102 

7. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 1 02 that 
form the basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 

(e) the invention was described in (1) an application for patent, published under section 122(b), by 
another filed in the United States before the invention by the applicant for patent or (2) a patent 
granted on an application for patent by another filed in the United States before the invention by the 
applicant for patent, except that an international application filed under the treaty defined in section 
351 (a) shall have the effects for purposes of this subsection of an application filed in the United States 
only if the international application designated the United States and was published under Article 21(2) 
of such treaty in the English language, 

8. Claims 1-7 are rejected under 35 U.S.C. 102(e) as being anticipated by Prasad 
et al. (US 5,956,718), hereinafter referred to as "Prasad". 

As per claims 1, 3, Prasad teaches a transmitting method and apparatus for 
transmitting a hierarchical structure of a directory for hierarchically managing locations 
of contents data, comprising: 
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• "managing means for managing a hierarchical structure of a directory composed 
of a container entry and a leaf entry, a container entry containing information in 
the immediately lower hierarchical level thereof, a leaf entry being disposed in 
the immediately lower hierarchical level of a container entry, a leaf entry not 
containing information in the immediately lower hierarchical level thereof at Col. 
2 lines 15-40 and Col. 4 lines 53-67; 

• "detecting means for detecting a change of the hierarchical structure of the 
directory managed by said managing means" at Col. 6 lines 20-30; 

• "and obtaining position information and identification information corresponding 
to the detected result, the position information representing the position of a 
container entry in the hierarchical structure of the directory, the identification 
information identifying a leaf entry corresponding to the hierarchical structure of 
the directory" at Col. 6 lines 1-20; 

• "and transmitting means for transmitting the position information and the 
identification information" at Col. 6 lines 60-65, Col. 15 lines 5-25. 

As per claim 2, Prasad teaches the transmitting apparatus as set forth in claim 
1, wherein "said detecting means obtains difference information representing a change 
of the leaf entries corresponding to the detected result of a change of the hierarchical 
structure of the directory, and wherein said transmitting means transmits the difference 
information along with the identification information" at Col. 5 line 25 to Col. 6 line 60. 
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As per claims 4,5, Prasad teaches a receiving apparatus for receiving a 
hierarchical structure of a directory for hierarchically managing the locations of contents 
data that are transmitted, comprising: 

• "receiving means for receiving position information and identification information, 
the position information being obtained by detecting a change of container 
entries, the position information representing the position of a container entry in 
the hierarchical structure of the directory, the identification information identifying 
a leaf entry corresponding to the hierarchical structure of the directory" at Col. 5 
line 25 to Col. 6 line 60; 

• "the directory being composed of container entries and leaf entries, a container 
entry containing information in the immediately lower hierarchical level thereof, a 
leaf entry not containing information in the immediately lower hierarchical level 
thereof at Col. 4 lines 53-67; 

• "obtaining means for selectively obtaining the identification information of a leaf 
entry in the immediately lower hierarchical level of a container entry represented 
by the position information corresponding to selection information designated 
corresponding to the position information" at Col. 6 lines 5-20; and 

• "managing means for managing the hierarchical structure of the directory formed 
with the position information corresponding to the selection information and with 
the identification information that is selectively obtained" at Col. 2 lines 15-40 and 
Col. 6 lines 50-60. 
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As per claims 6, 7, Prasad teaches a transmitting and receiving system for 
transmitting a hierarchical structure of a directory for hierarchically managing locations 
of contents data and receiving the transmitted hierarchical structure, comprising: 

• "first managing means for managing a hierarchical structure of a directory 
composed of a container entry and a leaf entry, a container entry containing 
information in the immediately lower hierarchical level thereof, a leaf entry being 
disposed in the immediately lower hierarchical level of a container entry, a leaf 
entry not containing information in the immediately lower hierarchical level 
thereof at Col. 2 lines 15-40 and Col. 4 lines 50-67. 

• "detecting means for detecting a change of the hierarchical structure of the 
directory managed by said first managing means and obtaining position 
information, identification information, and difference information corresponding 
to the detected result, the position information representing the position of a 
container entry in the hierarchical structure of the directory, the identification 
information identifying a leaf entry corresponding to the hierarchical structure of 
the directory, the difference information representing the difference of leaf 
entries" at CoL 5 line 25 to Col. 6 line 60; 

• "transmitting means for transmitting the position information, the identification 
information, and the difference information" at Col. 6 lines 30-65, Col. 14 lines 
45-55 and Fig. 10; 
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• "receiving means for receiving the position information, the identification 
information, and the difference information transmitted by said transmitting 
means" at Col. 6 lines 30-65, Col. 14 lines 45-55 and Fig. 10; 

• "obtaining means for selectively obtaining the identification information of a leaf 
entry in the immediately lower hierarchical level of a container entry represented 
by the position information corresponding to selection information designated 
corresponding to the position information" at Col. 15 lines 5-20; 

• "second managing means for managing the hierarchical structure of the directory 
formed with the position information corresponding to the selection information 
and with the identification information that is selectively obtained" at Col. 15 lines 
10-25. 

Conclusion 

9. The prior art made of record, listed on form PTO-892, and not relied upon, if any, 
is considered pertinent to applicant's disclosure. 

If a reference indicated as being mailed on PTO-FORM 892 has not been 
enclosed in this action, please contact Lisa Craney whose telephone number is (703) 
305-9601 for faster service. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Khanh B. Pham whose telephone number is (703) 308- 
7299. The examiner can normally be reached on Monday through Friday 7:30am to 
4:00pm. 
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If attempts to reach the examiner by telephone are unsuccessful, the examiner's 



for the organization where this application or proceeding is assigned is (703) 872-9306. 

Any inquiry of a general nature or relating to the status of this application or 
proceeding should be directed to the receptionist whose telephone number is (703)746- • 
7240. 

Khanh B. Pham 
Examiner 
Art Unit 2177 

KBP 

September 22, 2003 



supervisor, John E Breene can be reached on (703) 305-9790. The fax phone number 





Notice of References Cited 


Application/Control No. 
09/605,733 


Applicant(s)/Patent Under 
Reexamination 
YAMAGISHI ETAL 


Examiner 
Khanh B. Pham 


Art Unit 
2177 


Page 1 of 1 



U.S. PATENT DOCUMENTS 



* 




Document Number 
Country Code-Number-Kind Cods 


Date 

1 II J \/\/\/\/ 

MM-YYYY 


Name 


Classification 




A 


US-5 956 718 A 


OQ-1QQQ 


rrasao ex ai. 


707/10 




g 


US-6 052 724 A 


nd-9nno 

U4 - tUUU 


vviiiie et ai. 


709/223 




Q 


ijQ.fi Hfi5 017 A 


o^ 9000 


oarKer, r\ent u. 


707/202 




n 




no 9000 


Bunnell, Karl Lee 


707/102 




p 


[ R-fi 9^ R1 


o^ 9001 


Jeffords et al. 


709/316 




P 


1 J^-fi 4fifi Q^2 R1 


10 9009 


Dennis et al. 


707/3 




ri 
o 


I J^-fi R1 




Prasad et al. 


707/10 




H 


US-6,564,370 B1 


05-2003 


Hunt, Gary Thomas 


717/122 




I 


US- 










j 


US- 












US- 










|_ 


US- . 










M 


US- 












FOREIGN PATENT DOCUMENTS 


* 




Document Number 
Country Code-Number-Kind Code 


Date 
MM-YYYY 


Country 


Name 


Classification 




N 














0 














P 














Q 














R 














S 














T 
















NON-PATENT DOCUMENTS 






Include as applicable: Author, Title Date, Publisher, Edition or Volume, Pertinent Pages) 




U 


Daniels et al., "An Algorithm for Replicated Directory",Annual ACM Symposium on Principles of Distributed Computinq Canada 
1983, Pages: 104 -113. - 




V 


Balasubramaniam et al., "What is a File Synchronizer", International Conference on Mobile Computing and Networking , 1998, 
Pages: 98 - 108 . 




w 




*A cop 


X 

v of this 


> reference is not beina furnished with this Offing artinn fSoo mpfp k icn n*(*\ \ 



Dates in MM-YYYY format are publication dates. Classifications. may be US or foreign. 



U.S. Patent and Trademark Office 
PTO-892 (Rev. 01-2001) 



Notice of References Cited 



Part of Paper No. 5 



Page 1 of 4 



0 PORTAL 



US Patent & Trademark Office 



Subscribe (Full Service) Register (Limited Service, Free) Login 
I Search: C The Guide ® The ACM Digital Library 

'l ~ 



SEARCH 



I Feedback Report a problem Satisfaction 
survey 

An algorithm, for replicated directories 

Full text @Pdf(826KB) 

Source Annual ACM Symposium on Principles of Distributed Computing archive 

Proceedings of the second annual ACM symposium on Principles of distributed computing 

table of contents 
Montreal, Quebec, Canada 
Pages: 104-113 
Year of Publication: 1983 
ISBN:0-89791-110-5 

Authors Dean Daniels 

Alfred Z. Spector 

Sponsors SIGOPS : ACM Special Interest Group on Operating Systems 

SIGACT: ACM Special Interest Group on Algorithms and Computation Theory 

Publisher ACM Press New York, ny, usa 



Additional Information: abstract references citings index terms collaborative colleagues peer to peer 

Tools and Actions: Discussions Find similar Articles Review this Article 

Save this Article to a Binder Display in BibTex Format 



♦ ABSTRACT 
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range of keys associated with a version number changes dynamically; but in all instances, a separate 
version number is associated with each entry stored on every replica. The algorithm exhibits 
favorable availability and concurrency properties. There is no performance penalty for associating a 
version number with every possible key except on Delete operations, and simulation results show this 
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Abstract 

111 is paper describes a replication algorithm for directory objects based 
upon Gifford's weighted voting for Hies. The algorithm associates a 
version number with each possible key on every replica and thereby 
resolves an ambiguity that arises when directory entries arc not stored 
in every replica. The range of keys associated with a version number 
changes dynamically; but in all instances, a separate version number is 
associated with each entry stored on every replica. The algorithm 
exhibits favorable availability and concurrency properties. There is no 
performance penalty for associating a version number with every 
possible key except on Delete operations, and simulation results show 
this overhead is small. 

CR Categories and Subject Descriptors: C.2.4 
[Computer-Communication Networks]: Distributed Systems 
- Distributed applications', D.4.3: [Operating Systems]: File Systems 
Management - Directory sttvetures. Distributed Jiic systems: D.4.5: 
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Systems]: Systems - Distributed systems. 
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1 Introduction 

Object replication on distributed computing systems has the goals of 
increased parallelism, reduced communications costs, and increased 
resilience to failures. In particular, replication can permit increased 
data availability continued access to objects despite failures of one or 
more storage nodes. Unfortunately, it is difficult to achieve increased 
performance and reliability while ensuring thai the semantics of 
replicated data objects arc identical with their non- replicated 
counterparts. 

This paper presents a scheme for replicating directories that permits 
concurrent operations and arbitrarily high data availability. The 
semantics of the replicated directory arc typical of directories that arc 
stored on a single site. Briefly, directories contain a collection of entries, 
each of which contains a (key, value) pair with a unique key. The 
replicated directory has operations similar to the following: 
Ix)okup(K:Kcy) Kcturns(ttoolcati, Value). !nscrt(K:Kcy, V; Value), 
Update(K:Kcy, V:Valuc). and l)tlett(K:Kcy). Trivial modifications of 
this algorithm may be used to implement sets or similar abstractions. 

The replication algorithm that we present is similar to Gifford's 
weighted voting algorithm [Gifford 79, Gilford 81], and thus, has the 
same performance and reliability advantages. However, unlike 
Gifford's algorithm, our algorithm uses a new technique to associate a 
version number with each possible key at every replica. This technique 
permits concurrent operations on different entries and solves certain 
problems in the implementation of the deletion operation. Unlike most 
replication algorithms, which arc concerned with simple objects having 
only read and write operations, this algorithm uses the semantic 
properties of directories, and thereby gains increased performance. 

This work on replication is part of a larger research project studying 
distributed systems that use a transaction facility to support operations 
on shared abstract data types [Schwar/ 82, Spcctor 83]. 'flic replicated 
directory described in this paper is an example of a distributed abstract 
data type whose construction is facilitated by having a flexible 
underlying transaction mechanism available. Additional components 
of our research address synchronization, recovery, and communication 
issues. Groups at MIT and Georgia Institute of Technology are also 
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investigating the wider use of transactions [I Jsku-v 82, Wcih! 83. Allchin 
82, AUchin 83). 

In the following section of this paper, we survey related replication 
work and motivate the development of our nlgorithm. Wc then 
describe the algorithm in detail and present performance data that wc 
obtained via simulation. Finally, wc discuss additional ways to make 
the icplication algorithm function with greater efficiency and 
concurrency. 

2 Related Work and Motivation 



Weighted voting has several attributes that make it particularly 
appealing as the basis for the design of a replicated directory. First, the 
si/cs of the read and write quorums may be varied to adjust the relative 
cost and availability of reads and writes. A unanimous update strategy 
may be specified if desired. Second, representatives with zero votes 
may be used as hints [l.ampson 79]. 'Iliird. consistency and recovery 
arc mainly the responsibility of transactional storage systems, which arc 
assumed to hold each representative, Hccausc concurrent operations 
arc synchronized by the transaction system storing each representative, 
there can be considerable flexibility in the specification and 
implementation of concurrency control. 



'lliis section discusses the application of existing replication 
algorithms to the problem of replicated directories, and informally 
develops the proposed replication strategy. First, unanimous update 
and primary/secondary copy strategics arc briefly discussed. (Sec 
Lindsay for a brief survey of these strategics [Lindsay 79J.) Ilicn, 
weighted voung is considered and adapted for use in directory 
replication. 

In the unanimous update strategy, any update operation must be 
done on all replicas, but reads may be directed to any replica, 'ITiis 
replication strategy guarantees data consistency if the systems storing 
each replica guarantee data consistency locally. Unfortunately, the 
availability for updates of any object is poor when large numbers of 
replicas arc used. There have been attempts to increase update 
availability by using the communication system to buffer updates to 
rep Mens that arc not available. The SI)I>1 distributed database system 
uses an approach like this [Rothnic 77J. 

In replication strategics based on keeping primary and secondary 
copies of data, the primary copy receives all updates and then relays the 
updates to secondary copies. An inquiry may be sent to a secondary 
copy, but the result may not reflect the most current updates. Because 
responses to inquiries might not reflect recent updates, it is difficult for 
a primary/secondary copy replication strategy to duplicate the 
semantics of a non-rcplicaicd object. Techniques for lessening this 
problem have been developed; for example, the Locus system uses a 
synch ronization site [Popck 81]. 

Giflbrd designed a strategy for replication of files, which is bused on 
a sc h c mc called wt initial vol ing [G i f To rd 79, G i I" fo rd 8 1 1. ' 111 is 
algorithm assigns some number of votes and a version number to each 
representative (or replica) of a replicated Jtle suite. Write operations 
modify each representative in a write quorum of W votes and increment 
the version number of each representative in the quorum. Read 
operations read from each representative in a read quorum of R. votes 
and return data from the representative with Uic largest version 
number. I*hc sizes of the read and write quorums arc chosen so that 
\< + W is greater than the sum of votes assigned u> all representatives. 
Thus, every read quorum has a non-null interseciion with every write 
quorum and each inquiry is guaranteed to access at least one current 
copy of Uic data. 



While weighted voting is an appealing approach to directory 
replication, Uic basic algoriUim can not be applied to directories 
without undesirable concurrency limitations. Even though die 
semantics of directory operations permit concurrent modifications to 
different entries, only a single transaction could modify die directory at 
any Umc if a directory were stored as a replicated file suite. This is 
because each representative has a single version number, which causes 
Uic serialization of operations mat modify Uic directory. 

It might seem that these concurrency limitations could be overcome 
if each entry in a directory representative were assigned a separate 
version number. However, with such an approach, representatives 
might not have a version number for an entry that is stored on other 
representatives. Because of this, it may not be possible to examine an 
arbitrary read quorum and determine whether an entry for a particular 
key exists. 

For example, consider a 3-rcprcscntaUvc directory suite having a 
read quorum of 2 and a write quorum of 2: wc call this a 3*2-2 
directory. 1 Initially, each representative in Uic suite contains entries 
"a", and "c". and each entry has version number 1 as in Figure I 2 . 
Subsequently entry "b" is inserted into representatives A and II with 
version number 1 (Figure 2). If a "l-ookupC'b")" request is sent to 
representatives A and C at this point, representative A will respond 
with "present with version number 1". and representative C will reply 



Version Number; 1 
Key: "a" 



Version Number: 1 
Key: "c" 



Version Number: 1 
Key: "a" 



Version Number: 1 
Key: "c" 



Version Number: 1 
Key: "a" 



Version Number: 1 
Key: M c" 



Representative A Representative B Representative C 

Figure 1: A 3-2-2 Directory Suite - Initial Configuration 



e notation x-y r wit) refer to a directory having x representatives, a read quorum of 
y and a write quorum of z. For simplicity, all examples in this paper assume that each 
representative is assigned one vote. 



value Held is omitted from .all figures to save space. 
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Version Number: 1 
Key; "a" 



Version Number: 1 
Key, "b" 



Version Number 1 
Key: "C 



Version Number 1 
Key: "a" 



Version Number: 1 
Key: "b" 



Version Number 1 
Key: w c" 



Version Number. 1 
Key "a" 



Version Number: 1 
Key: "c" 



Representative A Representative B Representative C 

Figure 2: Directory Suite After Inserting "b" 
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Figure 3: Directory Suite After Deleting 'V 

"not present". If entry "b" is then deleted from representatives B and 
C (l ; igurc 3), "l.ookupC'b")" requests to representatives A and C will 
still elicit "present with version number l'\ and "not present" 
responses. Thus, if a directory representative fails to associate a version 
number with keys for which it has no entry, the responses from a read 
quorum may not be sufficient to determine if there is an entry in the 
directory suite for a given key. 

Hie ambiguity demonstrated above is associated with deletions and 
will not occur if deletions arc not permitted. Kntrics could be updated 
to indicate that they arc "deleted", but the space occupied by "deleted" 
entries could not easily be reclaimed. An alternative strategy is to 
eliminate the ambiguity by consulting an additional representative 
whenever one representative replies "present with version number jc" 
and another representative replies "not present." *l"his approach may be 
applied to any directory suite configuration, but it results in reduced 
availability. 

As has been demonstrated, associating a version number only with 
existing entries fails to capture important information about the version 
numbers of keys for which there arc not entries. If, however, a single 
version number per representative is used, concurrency is limited. A 
solution is to partition the space of possible keys and to associate a 
separate version number with each partition. 

A director)' could be partitioned by placing each key for which there 
is an entry in a separate partition, and maintaining a single additional 
partition for all keys that do not have entries. Such a directory keeps a 
version number with each entry and keeps an additional version 
number for use with "not present" responses. Under such a 
partitioning, deletions must increment the "not present" version 
number. Since the "not present" version number applies to a very large 



set of keys, this approach sutlers from concurrency limitations that arc 
similar to the single version number per representative approach. 
Alternatively, deletions could be implemented by marking entries to be 
deleted and then performing a "garbage collection" operation 
periodically. However, that operation is complex and would itself be a 
concurrency bottleneck. 

This paper will consider partitioning the key space into a set of 
disjoint ranges by imposing an ordering relation on the keys. The 
simplest approach is to use a static partitioning; however, the additional 
concurrency that is achieved might be less than expected. If a small 
number of ranges were used, then at most that number of transactions 
could modify a directory concurrently. Also, if transactions modify 
entries in more than one range, concurrency will be further limited. 
Kvcn if a large number of ranges were used, an uneven distribution of 
accesses could limit concurrency. 

Uelow, we concentrate on a technique in which the ranges of keys 
associated with version numbers change dynamically. A dynamic 
technique such as this might be desirable for directories having sixes or 
access patterns that vary widely over time. In this dynamic approach, 
each directory entry, and, consequently, its key, is in a range by itself 
with its own version number. Each range of keys between directory 
entries, called a gap. is a separate range with a separate version number. 

Because each entry in a directory representative is in a range by itself, 
lookup operations on such entries return the version number associated 
with the entry. Lookup operations on keys not in a directory 
representative return the version number of the gap in which the key 
appears. Update operations increment the version number of the range 
containing the entry being updated; insertion operations split a gap; 



Votsion Number: 0 
Key: <Low> 



Gap Version 
Number: 0 



Version Number: 1 
Key: "a" 



Gap Version 
Number: 0 



Version Number: 1 
Key "b" 



Gap Version 
Number: 0 



Version Number: 1 
Key: "c" 



Gap Version 
Number: 0 



Version Number: 0 
Key: <High> 



Version Number: 0 
Key: <Low> 



Gap Version 
Number: 0 



Version Number: 1 
Key: "a" 



Gap Version 
Number: 0 



Version Number: 1 
Key: "b" 



Gap Version 
Number: 0 



Version Number: 1 
Key: "c" 



G op Version 
Number: 0 



Version Number: 0 
Key: <High> 



Version Number: 0 
Key: <Low> 



Gap Version 
Number 0 



Version Number: 1 
Key: "a" 



Gap Version 
Number: 0 



Version Number: 1 
Key: "c" 



Gap Version 
Number 0 



Version Number; 0 
Key: <Htgh> 



Representative A Representative 0 Representative C 

Figure 4: Directory Suite After Inserting "b" 
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and deletions coalesce the gaps and entries in a range of keys into a 
single gap. For example, using this approach, entry "b" would be 
inserted into representatives A and D (of Figure 1) with version number 
1, which is one greater than the version number of the gap between "a" 

and "c" (Figure 4) 3 . If a "J-ookupCb")*' request were sent to 
representatives A and C at this point, representative A will respond 
with "'present with version number 1." and representative R will reply 
"not present with version number 0." Using these responses, a client 
may determine that there is an entry for *V since that response has the 
larger version number. If "b" is subsequently deleted from 
representatives JJ and C, then the two gaps on cither side of M b" on 
representative I) arc coalesced; then on both representatives, the gap 
between "a" and "c" is assigned version number 2. (Figure 5). 
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Figure 5: Directory Suite After l>clcting "b" 

The following section discusses this replication algorithm in more 
detail. 



3 Details of the Algorithm 

This section presents the details of the approach to directory 
replication sketched in the previous section. The descriptions given 
here arc illustrated with program text in a Pascal-like language that 
allows procedures to return multiple values and includes a remote 
procedure call primitive. Remote procedure calls arc written as 
"ScndKproccdurc invocation>) to(< object instance>)" and arc assumed 



*Thc directory representatives in Figure 4 contain the special keys LOW and I UGH. 
which delimit the first and last gaps in the representatives. 



to return values in the same fashion as a normal procedure invocation. 
These remote procedure calls arc similar in semantics to those of 
ARGUS [Liskov 82], except that error responses, such as timeouts, are 
not considered in these examples. Clarity is emphasized over 
performance in these descriptions and an inventive reader will find 
many improvements. 

There are three parts to the descriptions given here. First, the 
operations on directory representatives arc identified. Second, the 
operations on directory suites arc described and illustrated and finally, 
some correctness arguments arc given. 

3.1 Directory Representatives 

In a replicated directory, each directory representative is an instance 
of an abstract object that stores one copy of the directory data. 
Arbitrarily complex atomic transactions may be constructed using the 
basic operations provided by directory representatives. 'I"hus, directory 
representatives must synchronize concurrent operations performed by 
different transactions and store critical information in a fashion that 
recovers from failures. GifTbrd's weighted voting algorithm makes 
similar requirements on its file representatives. 

Bvcry instance of a directory representative contains two 
distinguished keys: HIGH and LOW. HIGH is greater than any key 
that can be inserted into the representative, and LOW is less than any 
key. HIGH and LOW simplify the directory suite delete operation by 
ensuring that all keys have a real successor and real predecessor in the 
directory. Real predecessor and real successor have an intuitive 
meaning, but arc defined precisely in Section 3.2. 

Directory representatives provide typical directory primitives: 
OirKcpl/Ookup and DirRcplnscrt. In addition, directory representatives 
provide specialized operations that arc used to implement the directory 
suite deletion operation: Dir Rep Predecessor, DirRcpSucccssor. and 
DirficpCoalcscc. Dir Rep Predecessor returns the key and version 
number of the entry in the representative that is the immediate 
predecessor of the key passed as an argument; it also returns die version 
number of the gap between the keys. DirRcpSucccssor is analogous to 
DirRcpPrcdcccssor. Deletions arc performed on a directory 
representative using the DirRcpCoalcscc operation, which deletes any 
entries appearing in a range between two specified entries and assigns a 
single version number to the resultant gap. Thus, DirRepCoalcsce 
coalesces a range of keys into a single gap. Figure 6 gives sample 
procedure headings for each of these operations. 

Each directory representative must synchronize the concurrent 
operations of different transactions. While this miglu be accomplished 
in many ways, the discussion presented here will assume that type- 
specific locking is uscd[Schwarz 82]. In type-specific locking, every 
operation on an abstract object acquires a lock that is a member of the 
set of locks associated with that object. A lock compatibility relation is 
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used to determine whether a kick may be acquired by ;i particular 
transaction. 



D1rRepLookup(x : key) 

Returns (boolean. version, value) ; 

{ If there Is an entry for x return TRUE, the 
version number of the entry, and Its 
value; otherwise return FALSE and the 
version number of the gap containing x. 

Locks RepLookup(x.x). } 

D1rRepPredecessor(x :key) 

Returns(key, version, version); 
{ Returns the key and version number of the 

entry with the largest key less than x. 

Also returns the version number of the 

gap between x and Us predecessor. There 

need not be an entry for x. 

Locks RepLookup(y,x) where y 1s the key 
returned. ) 

D1rRepSuccessor(x;key) 

Re turns (key, version, version) ; 
{ Returns the key and version number of the 

entry with smallest key greater than x. 

Also returns the version number of the gap 

botween x and Its successor. There need 

not be an entry for x. 

Locks RepLookup(x t y) where y 1s the key 
returned.} 

D1rRepInsert(x: key , v: version ,z : value) ; 
{ Creates an entry for key x with version 

number v and value z. Updates the entry 

for key x 1f one already exists. 

Locks RepMod1fy(x,x) .} 

01rRepCoa1esce(1 : key, h: key, v: vers Ion) ; 

{ Deletes entries for any keys between (but 
not Including) 1 and h. The resulting gap 
1s assigned version number v. An error Is 
Indicated 1f entries do not exist for keys 
1 and h. 

Locks RepMod1fy(l,h). } 

Figure 6: Directory Representative Operations 



Hie Jock classes used in synchronizing a directory representative arc 
the obvious analogs of the lock classes for a single-copy directory (given 
by Sen war/. (Schwa r/. 82]). However, instead of locking single keys, the 
lock classes arc general i/.cd to lock an entire range of keys and the 
granting of a lock depends on whether a range of keys to be locked 
intersects the range of keys already locked by some other transaction. 
Inquiry operations (DirRcpl.ookup, I )irRep Predecessor, and 
DirRcpSucccssor) set Rcplx>okup(a,r) locks, where the range of keys 
explicitly or implicitly accessed by operation is those keys greater 
than or equal to o and less than or equal tor. A RcpM odif y(o,r) lock 
is obtained on the keys of entries modified by the OirRcpInsert and 
DirRcpCoalcscc operations. 



"ITic lock compatibility relation for operations on directory 
representatives is illustrated in Figure 7. In the figure, \a.,.r] and 
(a\..T*J arc arbitrary non -intersecting ranges of keys, and [a...r] and 
[c"...t"] arc arbitrary intersecting key ranges. I^cks arc compatible 
except that a Rep Modify lock may not specify a range which intersects 
the range already specified by another Rep Modify lock, a RepModify 
lock may not specify a range which intersects the range already 
specified by a Rep lookup lock, and a Rep lookup lock may not specify 
a range which intersects a range already specified by a RepModify lock. 
For example, the compatibility relation specifics that a transaction may 
not be granted a RepModify(o",T") lock if another transaction already 
holds a RcpModif?(?,r) lock. 



I .oek Requested None 

RcpModifylo'V) OK 

RepModify(a'y) OK 

Rcpl.ookup(aV) OK 

RcpUokup(a\T ) OK 



l."V* Held 
RcpModif)(a.r) Kcpl^)okup(a.r) 

No No 

OK OK 

No OK 

OK OK 



Note: [a..r] intcrsects[a t \.r"\ and [o\.t] does not intersect [o\>r] 

Figure 7: Compatibility of Directory Representative I .ock Classes 

As specified, the lock compatibility relation is sufficiently strong to 
guarantee that the actions of transactions operating on a directory 
representative arc scriali/ablc fl'raigcr 82], providing that two phase 
locking is used. I*his form of synchronization simplifies correctness 
arguments given in Section 3.3. 

3.2 Directory Suites 

Directory suites consist of a set of directory representatives, a 
distribution of votes, and the read and write quorum sizes R and 
W. Operations on directory representatives arc combined to implement 
:i replicated directory based on the weighted voting rules described in 
Section 2. A Directory suite implements the operations 
IHrSuitcf x>okup. DirSuitclnscrt, DirSuUc Update, and DirSuitcDctctc. 

'Hie OirSuitcl^ookup operation sends DirRcpFookup requests to a 
read quorum of representatives and returns the results 4 of the reply 
with the largest version number. Code for this operation is given in 
Figure 8. 

Directory suite modification operations must ensure that the version 
number of the modified entry is higher than any version number that 
had been previously associated with the entry's key. In addition, the 
DtrSuitcDelcte operation must exercise care so that it docs not 
inadvertently give a higher version number to non-current data. 



Figure 8 sfrows I)irSuitcIxx>kup resuming a version number as well as a boolean and 
the value of the entry. The version number is used by the procedures KcalPrcdccessor, 
UirSuiteltBcrt, and DliSuilcModify. A user would ignore this Dumber. 
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D1rSu1teLookup(x:key) 

Re turns( boolean, version, value) 



Figure 9 illustrates this operation. The I )irSuitc Update operation is 
analogous. 



var 

{ read quorum has R members } 
quorum : array[l..R] of DlrRep; 
v, beatv : version; 
val, bestval : value; 
lain, bestlsln : boolean; 
1 : Integer; 

begin 

{ collect a read quorum for this operation} 
quorum ;- Co11ectReedQuorum(); 

bestv :• Lowes tVers ion; { a constant } 
{ send Inquiries to each quorum member } 
for 1;- 1 to R do 
begin 

Isln.v.val :-Send(D1rRepLookup(x)) 

to quorum[1]; 
1f v>bestv then 
begin 

bestv:-v; 
bestval :»va1; 
best1s1n: a 1s1n; 
end; 

end; { of for 1} 
return (best 1s1n, bestv, bestval) ; 
end; { of D1r$u1 teLookup } 

Figure 8: DirSuitef .ookup Operation 

DlrSul telnsert(x: key ( z rvalue); 

var 

{ write quorum has W members } 

quorum : array[l..W] of DlrRep; 

1 : Integer; 

k : key; 

v : version; 

val : value; 

1s1n: boolean; 

begin 

{ first, lookup the key to find the } 

{ current version number } 

1s1n, verbal : ■ DlrSul teLookup(x) ; 

{ val Ignored } 

1f 1$1n then ReportError( ) ; 

{ find a write quorum } 

quorum Col lectWr 1teQuorum( ) ; 

{ The new entry's version number must be } 
{ higher than its previous version number } 
{ as returned by the DlrSul teLookup call } 
ver : "ver+1; 

{ insert the entry 1n each quorum member } 
for 1:- 1 to W do 

Send( DlrRep I nsert(x, ver , z)) 
to{quorum[1]); 

end; {of D1rSu1teInsert} 

Figure 9: DirSuitc Insert Operation 



The DirSuftelnsort operation is quite simple. DirSuitc Insert first 
looks up the key to be inserted in a read quorum and uses one greater 
than the highest version number as the version number for the new 
entry. 'I Tic entry is then inserted in a write quorum of representatives. 



nirSuilcDcletc must delete an entry from a write quorum by 
coalescing a range of keys that includes the entry to be deleted and 
assigning a higher version number to the resulting gaps. To avoid 
assigning higher version numbers to datn that is nut current, the range 
to he coalesced may not contain directory suite entries other than the 
one to be deleted. To possess this property, the range must extend from 
the real predecessor t;f the key to be deleted to its real successor. The 
real predecessor of a key, x is the entry with the largest key less than x 
that appears in a write quorum of representatives. The real successor of 
a key is defined similarly. 

Locating the real predecessor and real successor of an entry that is to 
be deleted is complex. There may be ghosts of entries located between 
the deleted key and its real predecessor or real successor. A ghost is 
defined as an entry for a key that is no longer present in the directory 
suite. In. addition, the real predecessor or real successor of a key might 
not be present in some members of the write quorum. 



Version Number: 0 
Key: <Low> 



Gap Version 
Number 0 



Version Number: 1 
Key "a" 



Gap Version 
Number: 0 



Version Number: 1 
Key: "b" 



Gap Version 
Number: 0 



Version Number: 3 
Key: "fab" 



Gap Version 
Number: 0 



Version Number: 1 
Key: "c" 



Gap Version 
Number: 0 



Version Number: 0 
Key: <Htgh> 



Version Number 0 
Key: <Low> 



Gap Version 
Number: 0 



Version Number: 1 
Key: "a" 



Gap Version 
Number: 2 



Version Number: 3 
Key: "bb" 



Gap Version 
Number: 2 



Version Number: 1 
Key: "c" 



Gap Version 
Number: 0 



Version Number: 0 
Key: <High> 



Version Number: 0 
Key: <Low> 



Gap Version 
Number: 0 



Version Number: 1 
Key "a" 



Gap Version 
Number: 2 



Version Number: 1 
Key: "c" 



Gap Version 
Number: 0 



Version Number: 0 
Key: <Hlgh> 



Representative A Representative B Representative C 

Figure 10: Directory Suite from Figure 5 After Inserting "bb M 

'ITicsc problems arc illustrated in Figure 10. In this figure, die real 
successor of the entry "a" is the entry "bb". However "bb" docs not 
appear in representative C. and the ghost of entry 'V appears between 
"a" and "bb" in representative A. To delete "a" from representative A 
and C, the real successor, "bb**, must first be located and then copied to 
representative C. The coalescing of the range from LOW to "bb" 
eliminates the ghost of entry "b" from representative A. as shown in 
Figure 11. 
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Figure II: Directory Suite from Kigurc 10 After Deleting "a" 

A straightforward implementation of the procedure Real Predecessor, 
which locates the real predecessor of a key, is shown in Figure 12. 
Because of ghost entries, this procedure may have to examine many 
keys before finding the real predecessor. However, measurements 
reported in Section 4 indicate that this is not a problem in practice. The 
I )irSuitc Delete operation uses this procedure and the analogous 
procedure: RcalSucccssor. DirSuitc Delete locates the real successor 
and real predecessor of an entry to be deleted, and inserts entries for the 
real successor and real predecessor into any member of the write 
quorum where they do not appear. It then determines the version 
number to be assigned to the new gap and coalesces the range in each 
member of the write quorum. DirSuitcDclctc is illustrated in Figure 
13. 

3.3 Correctness Arguments 

The correctness of a directory Suite's operations depends on 
DirSuitel^ookup always returning current information about a key. 
Ilccausc every read quorum intersects every write quorum, 
DirSuitc lookup wit! return current information as long as that 
information has a version number greater than that of any non-currem 
information and as long as there arc no concurrency anomalies. These 
correctness conditions arc the same as those required for Gifford's file 
replication algorithm. 

Two phase locking and the lock compatibility matrices specified in 
Section 3.1 arc strong enough to guarantee the scrializability of 



Real Predecessor (x : key) 
Returns{ key , value, version .version) ; 
{returns key, value, and version number } 
{ of x's real predecessor, and the largest) 
{ gap version encountered while searching } 

var 

{read quorum has R members } 

quorum array[l..R] of DlrRep; 

pred, k, pk: key; 

pver, tv, v, vt, oaxv: version; 

pvalue: value; 

1: Integer; 

1s1n: boolean; 

begin 

{ collect a read quorum } 
quorum: »Col lectReadQuorum() ; 
k: -x; 

1s1n:-false; 

maxv : -Lowes tVer s 1on ; {a constant} 
while not 1s1n do 
begin 

pred: -Lowes tKey ; {a constant } 
for 1 : -1 to R do 
begin 

pk, tv,v:-Send(D1rRepPredecessor(k) ) 

to(quorum[1]); { tv, Ignored } 
pred :■ Max(pk, pred); 
maxv :• Max(v. maxv); 
end; {of for 1} 
1a 1 n, pver , pvalue: -01 rSu1teLookup( pred) ; 
1f not 1s1n then 
k:-pred; 
end; {of while do ) 
Return( pred, pvalue, pver .maxv); 
end; {of Real Predecessor } 

Figure 12: Real Predecessor Operation 



transactions at any single representative. Traigcr ct al [I'raigcr 82] have 
shown that if all nodes participating in distributed transaction execution 
follow two phase locking protocols that guarantee the serial izability of 
transactions at individual nodes, then the resulting global schedule is 
equivalent to some serial schedule of transactions. 

The DirSuitcInscrt and DirSuiteUpdatc operations both set the 
version number of the entries they modify to be greater than the 
greatest version number previously associated with the keys of those 
entries. Therefore, the current data for each key has a version number 
greater than that of any non-current data for that key. 

DirSuitcOclctc coalesces the range between the real predecessor and 
real successor of the key to be deleted. Uy the definition of real 
predecessor and real predecessor, there can bo no current entries (other 
Uian the entry to be deleted) in the range to be coalesced. The operation 
assigns to the coalesced range a new version number that is higher than 
any version number previously associated with every key in that range. 
Therefore, as with DirSuitcInscrt and DirSuiteUpdatc, the current data 
for each key has a version number greater than that of any non -current 
data for that key. 



110 



D1 rSu 1 teDal e ta ( x : key) ; 
var 

{ write quorum has W members } 
quorum : array[l..W] of DlrRep; 
1 : Integer; 
lain: boolean 
succ, pred, k: key; 
pval, sval, val: value; 
pver, svar, v, ver: version; 

begin 

{ find a write quorum } 

quorum Co11ectWr1teQuorum(); 

{ Find the successor of x } 

succ, sval ,sver , ver : - RealSuccossor(x) ; 

{ Find the predecessor of x } 

pred, pval .pver.v: - RealPredecessor(x) ; 

{ The version number of the coalesced gap } 
{ must be higher than the maximum of any } 
{ version numbers In the range coalesced } 
ver :« Max(v v ver); 
1s1n,v.val:°D1rSu1teLookup(x); 
{1$1n, val Ignored } 
ver Max(v, ver); 
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Figure 14: Simulation Results for Various Directory Suites 



{ make sure the predecessor and succeasor) 
{ exist 1n every member of the quorum) 
for 1 :• 1 to W do 
begin 

1s1n, v.val :■ Send(01rRepLookup(succ)) 
to(quorum[1]) ; 

{v.val Ignored} 
1f not 1s1n then 
Send(D1rRepInsert(succ,sver .svalue)) 
to (quorum[1]); 
1s1n, v.val : ■ Send(01rRepLookup(pred)) 
to(quorum[1}); 

{v. val Ignored} 
If not 1s1n then 
Send(D1rRepInsert(pred t pver , pvaluo)) 
to (quorum[1]); 
end; { for 1 } 

{ coalesce the range 1n each member } 
for 1:- 1 to W do 

$end(01rRepCoalesce(pred,succ,ver+l)) 
to (quorum[1]); 
end; {of DlrSulteDelete } 

Figure 13: DirSuitcDcletc Operation 



4 Performance Characterization 

This section presents the results of simulations of this directory 
replication strategy. There arc many statistics that characterize the 
performance of this algorithm, but only three were selected for the 
measurements presented here. 

The first statistic is labeled "Entries in ranges coalesced" and is the 
avenigc number of entries (per representative) that lie between the real 
predecessor and real successor of a deleted key. 'ITiis statistic counts the 
entry to be deleted, if it appears in a representative, and any ghosts that 
may be in the range to be coalesced. Entries for the real predecessors 
and real successors arc not included. This statistic reflects the number 
of entries that must be examined when the DirSuitcDcletc operation is 
locating the real predecessor and real successor of a entry. 



The second and diird statistics, labeled "Insertions while coalescing," 
and "Deletions while coalescing," arc the average numbers of insertions 
and extra deletions (per suite) performed during each DirSuitcDcletc 
operation. The insertion statistic counts the number of real 
predecessors and real successors that must be inserted on*' 
representatives, and the deletion statistic counts the number of ghost 
entries that must be deleted. These statistics reflect the extra work done 
by DirSuitcDcletc in addition to the work that would be done by the^ 
deletion operation of a unanimous update strategy having the number 
of replicas in a write quorum. 

Figure 14 shows the average results of simulations using directory 
si7.es of approximately one hundred entries with varying numbers of 
directory representatives and varying sizes of read and write quorums. 
The duration of each simulation was ten thousand operations, and the 
members of quorums and the keys to insert, update, or delete were 
selected randomly from a uniform distribution. 

More detailed results for 3*2-2 directories with one hundred, one 
thousand, and ten thousand entries arc shown in Figure 15. The 
duration of each of these simulations was one hundred thousand 
operations. The maximums and standard deviations that arc shown 
indicate the statistics do not vary significantly with directory size. 5 

The measurements of the first statistic indicate that the real 
predecessor and real successor of a key to be deleted will be located 
quickly if the simulation assumptions hold. For instance, if each 
member of a read quorum sends the results of three successive 



Wc believe that the ttaUsiia for the ten thousand entry directory do not reflect steady 
sute behavior. 
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100 Kntrics lOOOKnlrics 10000 Hntrics 

Kntrics in ranges coalesced 
1,33 9 0.87 1.32 12 0.86 1.20 9 0.76 

Deletions while coalescing 

AvfiMfljt SiJilto ASB Mi!S SisiDcv AvgMax Std Dcv 
0.88 8 1.05 0.87 U 1.04 0.67 9 0.90 

Insertions while coalescing 

AvfiMM £i<112cv Aig Max. Std Dcv Ave Max Std I>cv 
0.44 2 0.59 0.45 2 0.59 0.53 2 0.64 

Figure 15: Detailed Simulation Results for three 3*2-2 Directory Suites 

DiiltepPredecessor and DirUcpSucccssor operations in a single 
message, the real predecessor and real successor will often be located 
using one remote procedure call to each member of the quorum. The 
results for the second and third statistics indicate that the weighted 
voting algorithm docs little extra work during deletions, compared with 
a unanimous update strategy, 

5 Discussion 

Though the previous sections motivate and describe the basic 
replication algorithm, there arc many performance issues worthy of 
mention. First, it is interesting to note that if the memberships of write 
quorums change infrequently, coalescing during deletions will not be 
costly. Thus, the statistics presented in the previous section arc worse 
than could be achieved, because quorum members were selected 
randomly. In some ways, the algorithm behaves similarly to a moving 
primary update strategy [Alsbcrg 76] when write quorums change 
infrequently. 

If transactions that operate on a directory exhibit locality of reference 
with respect to keys, quorums can be chosen that permit reads to be 
done locally and non-local writes to be distributed among all the non- 
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Figure 16: A 4-2-3 Directory Suite Partitioned for Locality 

local representatives. 6 For example, consider a 4-2-3 directory suite 
with key values in the range of I to 100. and locality such that 
transactions of Type A operate on entries having keys 1 to 50. and 
transactions of Type B operate on entries having keys 51 to 100. We 



assume that representatives Al and A2 arc local to transactions of Type 
A and representatives 111 and 112 arc local to transactions of Type B. As 
shown in Figure 16, Type A transactions read from representatives Al 
and A2 and direct their updates to Al. A2. and cither Bl or B2. 
Transactions of type 11 behave similarly. In this example, all inquiries 
can be done locally and the non-local write that is required for 
modification operations is evenly distributed among the remote 
representatives. 

With respect to the implementation of the replication algorithm, the ' 
sketches we have provided arc pcdagogically sound, but not the most 
efficient. I jocking rules can be modified to permit greater concurrency 
without sacrificing sorializability. Additionally, intcr-rcprcscntativc 
message traffic can be reduced by combining certain remote procedure 
calls. We envision that directories could be represented as 
IMrccs [Comer 79|. Version numbers for gaps could be stored in fields 
in their bounding entries. For some applications, version numbers 
containing 48 or more bits may be required to prevent version numbers 
from cycling. 

*lhc performance characterizations presented in this paper arc based 
on simulations, however initial work on an analytical treatment 
indicates that we can obtain similar results from simple analytic models. 
Further simulations and practical experience are needed in order to 
quantify the additional concurrency permitted by this directory 
replication algorithm. We plan to implement this algorithm as well as 
Gifford's weighted voting algorithm for files using a prototype 
transaction-based system we arc constructing on a modified version of 
the Accent kernel [Rashid 81 J. 

In summary, this paper has presented a replication algorithm for 
directories that exhibits favorable performance and availability 
properties. As is the case with Giffbrd's algorithm, the exact 
configuration of suites can be tailored to provide higher or lower 
availability, and higher or lower performance. This algorithm achieves 
high concurrency while maintaining consistency by dynamically 
partitioning die directory by range and associating a version number 
with each range. Simulation results show the extra costs associated with 
maintaining the consistency of a directory replicated using our 
algorithm is low. 
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Of course, failures that require the quorums to change will result only in a 
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Abstract 

Mobile computing devices intended for disconnected opera- 
tion, such as laptops and personal organizers, must employ 
optimistic replication strategies for user files. Unlike tradi- 
tional distributed systems, such devices do not attempt to 
present a "single filesystem" semantics: users are aware that 
their files are replicated, and that updates to one replica will 
not be seen in another until some point of synchronization is 
reached (often under the user's explicit control). A variety 
of tools, collectively called file synchronizers, support this 
mode of operation. 

Unfortunately, present-day synchronizers seldom give the 
user enough information to predict how they will behave un- 
der all circumstances. Simple slogans like ^Non-conflicting 
updates are propagated to other replicas" ignore numerous 
subtleties — e.g.. Precisely what constitutes a conflict be- 
tween updates in different replicas? What does the syn- 
chronizer do if updates conflict? What happens when files 
are renamed? What if the directory structure is reorganized 
in one replica? 

Our goal is to offer a simple, concrete, and precise frame- 
work for describing the behavior of file synchronizers. To 
this end, we divide the synchronization task into two concep- 
tually distinct phases: update detection and reconciliation. 
We discuss each phase in detail and develop a straightfor- 
ward specification of each. We sketch our own prototype 
implementation of these specifications and discuss how they 
apply to some existing synchronization tools. 

1 Introduction 

The growth of mobile computing has brought to fore novel is- 
sues in data management, in particular data replication un- 
der disconnected operation. Support for replication can be 
provided either transparently (with filesystem or database 
support for client-side caching, transaction logs, etc.) or by 
user-visible tools for explicit replica management. In this pa- 
per we investigate one class of user-visible tools — commonly 
called file synchronizers— which allow updates in different 
replicas to be reconciled at the user's request. 
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The overall goal of a Hie synchronizer is easy to 
state: it must detect conflicting updates and propagate non- 
conflicting updates. However* a good synchronizer is quite 
tricky to implement. Subtle misunderstandings of the se- 
mantics of fileystem operations can cause data to be lost or 
overwritten. Moreover, the concept of "user update" itself is 
open to varying interpretations, leading to significant differ- 
ences in the results of synchronization. Unfortunately, the 
documentation provided for synchronizers typically makes 
it difficult to get a clear understanding of what they willdo 
under all circumstances: either there is no description at all 
or else the description is phrased in terms of low-level mech- 
anisms that do not match the user's intuitive view of the 
filesystem. In view of the serious damage that can be done 
by a synchronizer with unintended or unexpected behavior, 
we would like to establish a concise and rigorous framework 
in which synchronization can be described and discussed, 
using terms that both users and implementors can under- 
stand. 

We concentrate on file synchronization in this paper and 
only briefly touch upon the finer-grained notion of data syn- 
chronization offered by newer tools [Puma, DDD + 94, etc.], 
but most of the fundamental issues are the same for file and 
data synchronization. These issues are also closely related to 
replication and recovery after partitions in mainstream dis- 
tributed systems [DGMS85, Kis96, GPJ93, DPS+94, etc.]. 
Ultimately, we may hope to extend our specification to en- 
compass a wider range of replication mechanisms, from data 
synchronizers to distributed filesystems and databases. 

In our model, a file synchronizer is invoked explicitly 
by an action of the user (issuing a synchronization com- 
mand, dropping a PDA into a docking cradle, etc.). For 
purposes of discussion, we identify two cleanly separated 
phases of the file synchronizer's task: update detection — 
i.e., recognizing where updates have been made to the sep- 
arate replicas since the last point of synchronization — and 
reconciliation — combining updates to yield the new, syn- 
chronized state of each replica. 

The update detector for each replica S computes a pred- 
icate dirtys that summarizes the updates that have been 
made to 5. (It is allowed to err on the side of safety, indi- 
cating possible updates where none have occurred, but all 
actual updates must be reported.) The reconciler uses these 
predicates to decide which replica contains the most up-to- 
date copy of each file or directory. The contract between the 
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two components is expressed by the requirement 

for ail paths p, 
-*dirtys(p) 

currentContentss(p) = originalContents s (p) y 



which the update detector must guarantee and on which the 
reconciler relies. The whole synchronization process may 
then be pictured as follows: 




The filesystems in both replicas start out with the same con- 
tents O. Updates by the user in one or both replicas lead to 
divergent states A and B at the time when the synchronizer 
is invoked. The update detectors for the two replicas check 
the current states of the filesystems (perhaps using some 
information from O that was stored earlier) and compute 
update predicates dirty a and dirty b* The reconciler uses 
these predicates and the current states A and B to compute 
new states A' and B', which should coincide unless there 
were conflicting updates. The specification of the update 
detector is a relation that must hold between 0, A, and 
dirty a and between 0, B, and dirty b\ similarly, the behav- 
ior of the reconciler is specified as a relation between A, B, 
dirty a, dirty By A', and B\ 

The remainder of the paper is organized as follows. We 
start with some preliminary definitions in Section 2. Then, 
in Sections 3 and 4, we consider update detection and rec- 
onciliation in turn. For update detection, we describe sev- 
eral possible implementation strategies with different perfor- 
mance characteristics. For reconciliation, we first develop a 
very simple, declarative specification: a small set of natural 
rules that describe the behavior of a typical synchronizer. 
We then argue that these rules completely characterize the 
behavior of any synchronizer satisfying them, and finally 
show how they can be implemented by a straightforward re- 
cursive algorithm. Section 5 sketches our own synchronizer 
implementation, including the design choices we made in our 
update detector. Section 6 discusses some existing synchro- 
nizers and evaluates how accurately they are described by 
our specification. Section 7 describes some possible exten- 
sions. 

Most of our development is independent of the features 
of particular operating systems and the semantics of their 
filesystem operations; the one exception is in the implemen- 
tation of update detectors (Section 3.2), which are neces- 



sarily system-specific; our discussion there is biased toward 
Unix. For the sake of brevity, proofs are omitted. 

2 Basic Definitions 

To be rigorous about what a synchronizer does to the filesys- 
tems it manipulates, the first thing we need is a. precise way 
of talking about the filesystems themselves. 

We use the metavariables x and y to range over a set M 
of filenames. V is the set of paths — finite sequences of names 
separated by dots. (The dots between path components can 
be read as slashes by Unix users, backslashes by Windows 
users, and colons by Mac users.) The metavariables p, q t 
and r range over paths. The empty path is written e. The 
concatenation of paths p and q is written p.q. We write |p| 
for the length of path p— Le., |c| = 0 and |g.x| = |g| + 1. 
We write q < p if q is a prefix of p, i.e., Hp — q.r for some 
path r. We write q < p if q is a proper prefix of p, i.e., q < p 
and g # p- 

For the purposes of this paper, there is no need to be 
specific about the contents of individual files. We simply 
assume that we are given some set T whose elements are 
the possible contents of files — for example, T could be the 
set of all strings of bytes. 

For modeling filesystems, there are many possibilities. 
Most obviously, we could use the familiar recursive datatype: 

TS = T W {M^TS) 

That is, a "filesystem node" is either a file or a directory, 
where a file is some / € T and a directory is a finite partial 
function mapping names to nodes of the same form. For 
example, the filesystem 

fpml 
[prI 

a/ \ b 

whose root is a directory containing one subdirectory named 
d t which contains two files a (with contents /) and b (with 
contents g), would be represented by the function 

F = {d H> £>, 

n JL for all other names n}, 

where J- marks positions where F is undefined and D is the 
function 

D = {o i-> /, 6 g, 

n M- J_ for all other names n}. 

For purposes of specification, however, it seems more 
convenient to use a "flat" representation, where a filesys- 
tem is a function mapping whole paths to their contents. 
Formally, we say that a filesystem is an element of the set 

Vp,qeV. S(p.q) = (S(p))(q) }. 

of finite partial functions from paths to either files or sub- 
fUesystems. The constraint on the second line guaran- 
tees that we only consider functions corresponding to tree 
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structures — i.e., ones where looking up the contents of a 
composite path p.q yields the same result as first looking 
up p and then looking up q in the resulting sub-filesystem 
(where the application expression (S(p))(q) is defined to 
yield _L if Sip) is either ± or a file). 

Under this representation, the example filesystem above 
corresponds to the function 

F = {< F t d h» D, d.a >-> /, d.b »-> g t 
p »-+ ± for all other paths p}, 

where D is the function 

Z)={eHD,a4 f,b*-+g, 

p h»- ± for all other paths p}. 

The metavariables O, 5, T, A, C\ and D range over 
filesystems. 

When S is a filesystem, we write |S| for the length of the 
longest path p such that S(p) # J-. We write c/tfMren,4(p) 
for the set of names denoting immediate children of path p 
in filesystem A — that is, 

children A (p) = {q \ q = p.x for some x A A($) ^ ±}. 

We write children A> B(p) for c/w7dren^(p) U children B (p). 

We write is<f«vt(p) to mean that p refers to a directory 
(i.e., not a file and not nothing) in the filesystem A. We 
write isdirA.aip) iff both isdirA(j>) and isdiraip). 

To lighten the notation in what follows, we make some 
simplifying assumptions. First, we assume that, during syn- 
chronization, the filesystems are not being modified except 
by the synchronizer itself. This means that they can be 
treated as static functions (from paths to contents), as far 
as the synchronizer is concerned. Second, we assume that, at 
the end of the previous synchronization, the two filesystems 
were identical. Third, we handle only two replicas. Finally, 
we ignore links (both hard and symbolic), file permissions, 
etc. Section 7 discusses how our development can be refined 
to relax these restrictions. 

3 Update Detection 

With these basic definitions in hand, we now turn to the 
synchronization task itself. This section focuses on update 
detection, leaving reconciliation for Section 4. 

3.1 Specification 

We first recapitulate the specification of the update detector 
sketched in the introduction: 

3.1.1 Definition: Suppose O and S are filesystems. Then 
a predicate dirtys is said to (safely) estimate the updates 
from O to S if -*dirtys(p) implies 0(p) = 5(p), for all paths 
P. 

Among other things, this definition immediately tells us 
that, if a given path p is not dirty in either replica, then the 
two replicas have the same contents at p. 

3.1.2 Fact: If A, B, and O are filesystems and dirtyA and 
dirty B estimate the updates from O to A and O to B, then 
idirty A (p) and -*dirty B (p) together imply A(p) = B(p). 

One other fact will prove useful in what follows. 

3.1.3 Fact: For any filesystem S, dirtys is up-closed i.e., if 
p <q and dirty s(q), then dirtys(p). We shall use this fact 
to streamline the specification of reconciliation below. 



3.2 Implementation Strategies 

Update detectors satisfying the above specification can be 
implemented in many different ways; this section outlines 
a few and discusses their pragmatic advantages and disad- 
vantages. The discussion is specific to Unix filesystems, but 
most of the strategies we describe would work with other 
operating systems too. 

3.2.1 Trivial Update Detector 

The simplest possible implementation is given by the con- 
stantly true predicate, which simply marks every file as dirty, 
with the result that the reconciler must then regard every 
file (except the ones that happen to be identical in the two 
filesystems) as a conflict. In some situations, this may ac- 
tually be an acceptable update detection strategy. On one 
hand, the fact that the reconciler must actually compare 
the current contents of all the files in the two filesystems 
may not be a major issue if the filesystems are small enough 
and the link between them is fast enough. On the other 
hand, the fact that all updates lead to conflicts may not be 
a problem in practice if there are only a few of them. The 
whole file synchronizer, in this case, degenerates to a kind 
of recursive remote diff. 

3.2.2 Exact Update Detector 

On the other end of the spectrum is an update detector that 
computes the dirty predicate exactly, for example by keeping 
a copy of the whole filesystem when it was last synchronized 
and comparing this state with the current one (i.e., replacing 
the remote diff in the previous case with two local diffs). 

Detecting updates exactly is expensive, both in terms of 
disk space and — more importantly — in the time that it takes 
to compute the difference of the current contents with the 
saved copies of the filesystem. On the other hand, this strat- 
egy may perform well in situations where it is run off-line 
(in the middle of the night), or where the UtiIc between the 
two computers has very low bandwidth, so that minimizing 
communication due to false conflicts is critical. 

3.2.3 Simple Modtime Update Detector 

A much cheaper, but less accurate, update detection strat- 
egy involves using the "last modified time" provided by oper- 
ating systems like Unix. With this strategy, just one value is 
saved between synchronizations in each replica: the time of 
the previous synchronization (according to the local clock). 
To detect updates, each file's last-modified time is compared 
with this value; if it is older, then the file is not dirty. 

Unfortunately, the most naive version of this simple 
strategy turns out to be wrong. The problem is that, in 
Unix, renaming a file does not update its modtime, but 
rather updates the modtime of the directory containing the 
file: names are a property of directories, not files. For ex- 
ample, suppose we have two files, a and 6, and that we move 
a to 6 (overwriting 6) in one replica. If we examine just the 
modtime of the path 6, we will conclude that it is not dirty, 
and, in the other replica, a will be deleted without 6 being 
changed. 

Similarly, it is not enough to look at a file's modtime 
and its directory's, since the directory itself could have been 
moved, leaving its modtime alone but changing its parent 
directory's modtime. To avoid the problem completely, we 
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must judge a file as dirty if any of its ancestors (back to the 
root of the filesystem) has a modtime more recent than the 
last synchronization. Unfortunately, this makes the simple 
modtime detector nearly useless in practice, since any up- 
date (file creation, etc.) near the root of the tree leads to 
large subtrees being marked dirty. 

3.2.4 Modtime-Inode Update Detector 

A better strategy for update detection under Unix relies on 
both modtimes and inode numbers. We remember not just 
the last synchronization time, but also the inode number 
of every file in each replica. The update detector judges 
a path as dirty if either (1) its inode number is not the 
same as the stored one or (2) its modtime is later than the 
last synchronization time. There is no need to look at the 
modtimes of any containing directories. 

For example, if we move a on top of 6, as above, then 
the new contents of that replica at the path b will be a file 
with a different inode number than what was there before. 
Both a and b will be marked as dirty, leading (correctly) to 
a delete and an update in the other replica. 

We have also experimented with a third variant, where 
inode numbers are stored only for directories, not for each 
individual file. This uses much less storage than remember- 
ing inode numbers for all files, but is not as accurate. Our 
own experience indicates that storing all the inode numbers 
is a better tradeoff, on the whole. 

3.2.5 On-Line Update Detector 

A different kind of update detector — one that is difficult 
to implement at user level under Unix but possible under 
some other operating systems such as Windows — requires 
the ability to observe the complete trace of actions that the 
user makes to the filesystem. This detector will judge a file 
to be modified whenever the user has done anything to it 
(even if the net effect of the user's actions was to return 
the file to its original state), so it does not, in general, give 
the same results as the exact update detector. But it will 
normally get close, and may be cheaper to implement than 
the exact detector. 

On-line upate detection presupposes the ability to track 
all user actions that affect the filesystem; this places it closer 
to the domain of traditional distributed filesystems (cf., for 
example, Coda [Kis96, Kum94], Ficus [RHR + 94, PJG+97], 
Bayou (TTP+95, PST+97], and LittleWorks [HH95]). 

4 Reconciliation 

We now turn our attention to the other major component 
of the synchronizer, the reconciler. We begin by develop- 
ing a set of simple requirements that any implementation 
should satisfy (Section 4.1), Then we give a recursive algo- 
rithm (Section 4.2) and argue (a) that it satisfies the given 
requirements, and (b) that the requirements determine its 
behavior completely, i.e., that any other synchronization al- 
gorithm that also satisfies the requirements must be behav- 
iorally indistinguishable from this one (Section 4.3). 

4.1 Specification 

Suppose that A and B are the current states of two filesys- 
tems replicating a common directory structure, and that we 
have calculated predicates dirty a and dirty b, estimating the 



updates in A and B since the last time they were synchro- 
nized. Running the reconciler with these inputs will yield 
new filesystem states C and D. Informally, the behavioral 
requirements on the synchronizer can be expressed by a pair 
of slogans: (1) propagate all non-conflicting updates, and (2) 
if updates conflict, do nothing. 

(Of course, an actual synchronization tool will typically 
try to do better than "do nothing" in the face of conflicting 
updates: it may, for example, apply additional heuristics 
based on the types of files involved, ask the user for advice, 
or allow manual editing on the spot. Such cleanup actions 
can be incorporated in our model by viewing them as if 
they had occurred just before the synchronizer began its real 
work.) 

We are already committed to a particular formalization 
of the notion of update (cf. Section 3): a path is updated 
in A if its value in A is different from its original value 
at the time of last synchronization. We can formalize the 
notion of conflicting updates in an equally straightforward 
way: updates in A and B are conflicting if the contents of A 
and B resulting from the updates are different. If A and B 
are both updated but their new contents happen to agree, 
these updates will be regarded as non-conflicting. (Another 
alternative would be to say that overlapping updates always 
conflict. But this will lead to more false positives in conflict 
detection.) 

Our specification of the reconciler can be stated as a set 
of conditions that should hold between the starting states, 
A and 2?, and the reconciled states, C and D } for every path 
p. Informally: 

1. If p is not dirty in A, then we know that the entire 
subtree rooted at p has not been changed in A, and 
any updates in the corresponding subtree in B should 
be propagated to both sides; that is, C(p) (the subtree 
rooted at p in C) and D(p) should be identical to B(p); 

2. Conversely, if p is not dirty in B, then we should have 
C(p) = D(p) = A(p). 

3. If p refers to a directory in both A and B y then it should 
also refer to a directory in C and D. (Note that this 
requirement makes sense whether or not p is dirty in A 
orB.) 

4. If p is dirty in both A and B and refers to something 
other than a directory (i.e., it is either a file or JL) 
in at least one of A and B t then we have potentially 
conflicting updates. In this case, we should leave things 
as they are: C(p) = A(p) and D(p) = B(p). (Note 
that leaving things as they are is the right behavior 
even in the case where the updates were not actually 
conflicting — i.e., where it happens that A(p) = B(p).) 

A few examples should clarify the consequences of these re- 
quirements. Suppose the original state O of the filesystems 
was 

O = Gnu 

I' 
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and that we have obtained the current states A and B by 
modifying the contents of d.a in A and d.6 in B. Suppose, 
furthermore (for the sake of simplicity), that we are using 
an exact update detector, so that diriy A is true for the paths 
d.a> d y and c and false otherwise, while dirtys is true for d.6, 
rf, and c. Then, according to the requirements, the resulting 
states of the two filesystems should be C and D as shown. 




t t 




The update in d.a in A has propagated to B and the update 
in d.b to A y making the final states identical. 

Suppose, instead, that the new filesystems A and B are 
obtained from O by adding a file in A and deleting one in 
B: 




t t 




This is an instance of the classic insert/delete ambigu- 
ity [FM82, GPJ93, PST+97] faced by any synchronization 
mechanism: if the reconciler could see only the current states 
-4 and B, there would be no way for it to know that c had 
been added in A, as opposed to having existed on both sides 
originally and having been deleted from B; symmetrically, 



it could not tell whether a was deleted in B or new in A. 
The dirty predicates provided by the update detector resolve 
the ambiguity: c is dirty only in A, while a is dirty only in 
B. (Note that a less accurate update detector might also 
mark c dirty in B or a dirty in A. The effect would then 
be a conflict reported by the reconciler and no changes to 
the filesystems — i.e., the specification requires that synchro- 
nization "fail safely.") 

Similarly, suppose the file d.a is renamed, in A, to d.c, 
and that d.6 is deleted in B. In A, the paths marked dirty 
are d.a, d.c, d, and c. In B, the dirty paths are d.6, d, and 
c. So, reconciliation will result in states C and D as shown. 




On the other hand, suppose that d.a is modified in A and 
deleted in B, and that d.6 is updated only in B. The dirty 
paths in A are d.a, d, and c; in B they are d.a, d.6, d, and 
e. The final clause above thus applies to d.a, leaving it un- 
modified in C and jD, while the update to d.6 is propagated 
to A as usual. 
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One small refinement is needed to complete the speci- 
fication of reconciliation. In what we've said so far, we've 
considered arbitrary paths p. This is actually slightly too 
permissive, leading to cases where two of the requirements 
above make conflicting predictions about the results of syn- 
chronization. Suppose, for example, that, A and B are ob- 
tained by delete the whole directory d on one side and cre- 
ating a new file d.c within d on the other: 

A = (mr) 

\* 




The contents of D(d.c) should clearly be h after synchroniza- 
tion But what should be the contents of C{d.c)l On the one 
hand, we have dirty A(d) and dirty b{<3) and ^isdirA % s{d) % so 
according to the final rule we should have C(d) = A(d) = 1, 
which implies C(d.c) = ±. But, on the other hand, we have 
^dirtyA(d.c) t so according to the first rule, we should have 
C(dx) = B(d.c) = h. 

This is a case of a genuine conflicting update, and we 
believe the best value for C(d.c) here is ± (the authors of at 
least one commercial synchronizer would disagree — cf. Sec- 
tion 6.1). We can resolve the ambiguity by stopping at the 
first hint of conflict — i.e., by considering only paths p where 
all the ancestors of p in both A and B refer to directories 
(and hence do not conflict): 

4.1.1 Definition: Let A and B be filesystems. A path p is 
said to be relevant in (A t B) iff Vg < p. isdirA t B(q). 

With this refinement, we are ready to state the formal 
specification of the reconciler. 

4.1.2 Definition [Requirements]: The pair of new 
filesystems (C, D) is said to be a synchronization of a pair 
of original filesystems (A,B) with respect to predicates 
dirtyA and dirtys if, for each relevant path p in (A,£), the 
following conditions are satisfied: 

-•dirty a(p) 

=> C(p) = Dip) = B(p) 
-*dirtijB(p) 

=> C(p) = Dip) = A(p) 
isdir At B(p) 

isdirc t D{p) 
dirtyA(p) A dirty B (p) A -*isdir A . B (p) 

=> C(p) = A(p) A £>(p) = B{p) 

4.2 Algorithm 

Having specified the reconciler precisely, we can explore 
some properties of the specification. In particular, we would 
like to know that it is complete, in the sense that it answers 
all possible questions about how a reconciler should behave, 
and that it is implementabte by a concrete algorithm that 
terminates on all inputs. We address the latter point first. 

For ease of comparison with the abstract requirements 
above, we present the algorithm in "purely functional" 
style — as a function taking a pair of filesystems as an ar- 
gument and returning a fresh pair of filesystems as a result. 



(Of course, a concrete realization of this algorithm would 
return no results, performing its task by side-effecting the 
two filesystems in-place. It should be obvious how to derive 
such an implementation from the description we give here.) 

In the definition, we use the following notation for over- 
writing part of one filesystem with the contents of the other. 
Let S and T be functions on paths and p be a path. We 
write T £■ S for the function formed by replacing the sub- 
tree rooted at p in T with £, defined formally as follows: 

T£S = \q. tip <q then S(q)dseT(q). 

4.2.1 Definition [Reconciliation Algorithm]: Given 
predicates dirtyA and dirty b, the algorithm recon is defined 
as follows: 

recon(A t B t p) = 

1) if -*dirtyA(p) A -*dirtyB(p) 

then (A,B) 

2) else if isdirA t B(p) 

then let {pi,p2, . . . ,Pn} = children A ,B(p) 

(in lexicographic order) 
inlet (A 0 , B 0 ) = (A,£) 

let (Ai+ u Bi+i) = recon(Ai,Bi,pi+i) 

for 0 < t < n 
in (A n ,B n ) 

3) else if -*dirtyA(p) 

then (A £-£,£) 

4) else if ~^dirtyB(p) 

then (A,B£ A) 

5) else 

(a, By 

That is, recon takes a pair of filesystems A and B and a 
path p, and returns a pair of filesystems (C, D) in which the 
subtrees rooted at p have been synchronized. 

An easy induction on max(|j4|, IB|)-(p| shows that recon 
terminates for all filesystems A and B and paths p. Also, ob- 
serve that updates to the filesystems A and B are performed' 
only through the recursive calls and the grafting function 
defined above; this ensures that recon(A t B,p) leaves unaf- 
fected all parts of A and B that are outside the subtree 
rooted at p. 

4.3 Properties 

It remains, now, to verify some properties of the require- 
ments specification and the algorithm. In particular, we 
can show that (1) the requirements in Definition 4.1.2 fully 
characterize the behavior of the reconciler; and that (2) the 
reconciliation algorithm is sound with respect to the speci- 
fication, i.e., it satisfies the requirements in Definition 4.1.2. 
It is an immediate consequence of the latter fact that the 
requirements themselves are consistent, in the sense that, 
for each A, B % dirty Al and dirty Bi there are some C and D 
such that (C, D) is a synchronization of (A, B) with respect 
to dirtyA and dirty b. 

To facilitate the correctness arguments, we first intro- 
duce a refinement of the original requirements that allows 
us to focus our attention on a specific region of the two 
filesystems. 

4.3.1 Definition: The pair of new filesystems (C,D) is 
said to be a synchronization after p of a pair of original 
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filesystems (A,B) if p is a relevant path in {A,B) and the 
following conditions are satisfied for each relevant path p.q 
m(A,B): 

-*dirtyA(p.q) 

C(p.q) = D(p.q) = B(p.q) 

-^dirtyaip.q) 

C(p.q) = D(p.q) = A{p.q) 

isdir A ,B(p-q) 

=> wrfirc,D(p.g) 
d*rfy4(p.$) A dirty a(p-q) A -ifaftrvi.Bfr.?) 

C(p.q) = A(p.g) A D(p. 9 ) = B(p.q) 

Note that Definition 4.1.2 is just the special case where p — 
c. 

4.3.2 Definition: Paths p and q are incomparable if neither 
is a prefix of the other — i.e., Mp^q A q£p. 

4.3.3 Definition: We write syncp(C t D,A t B) if 

1. (C, D) is a synchronization of (j4, i?) after p, 

2. for all paths q t if p and 9 are incomparable then = 
A(q) and £>($) = B(q), and 

3. q <p A i$dirA,B{q) implies isdirc t o(q) 

The requirements we have placed on the reconciler are 
complete in the sense that they uniquely capture its behav- 
ior: given two filesystems which were synchronized at some 
point in the past, there is at most one pair of new filesystems 
satisfying the requirements. 

4.3.4 Proposition [Uniqueness]: Let A, J3, and O be 
filesystems and suppose that diriyA and dirtys estimate the 
updates from O to A and B respectively. Let p be a rel- 
evant path in (A t B). If (C u Di) and {C 2l D 2 ) are both 
synchronizations of (A t B) after p, then Ci(p) = C2Q)) and 
D x (p) = D 2 (p). 

Furthermore, the requirements are satisfied by the algo- 
rithm. 

4.3.5 Proposition [Soundness]: Let A, B, and O be 
filesystems and suppose that diriyA and dirtys estimate 
the updates from O to A and B respectively. Then 
recon(A y B y p) = (C,D) impUes syncp(C,D y A t B) for any 
relevant path p in (A, £). 

Together, propositions 4.3.5 and 4.3.4 show that al- 
gorithm recon is actually equivalent to the requirements 
given in Definition 4.1.2. On the one hand, if (C,D) = 
recon{A y B y e) % then by soundness we know that {C,D) is a 
synchronization of A and B. On the other hand, suppose 
(C, D) is a synchronization of A and B. Since the algorithm 
is total, it must yield recon(A y B,e) = (C, D 1 ) for some C 
and D'. But then by uniqueness, we have C — C and 
D = £?'. 

5 Our Implementation 

Our main goal has been to understand the synchronization 
task clearly, not to produce a full-featured synchronizer our- 
selves. However, we have found it helpful (as well as useful, 
for our own day to day mobile computing) to experiment 



with a prototype implementation that straightforwardly em- 
bodies the specification we have described. 

Our file synchronizer is written in Java, using Java's 
Remote Method Invocation for networking. The design is 
intended to perform well over both high- and medium- 
bandwidth links (e.g., ethernet or PPP). To avoid long 
startup delays, it uses a modtime-inode strategy (cf. Sec- 
tion 3.2.4) for update detection, requiring only minimal sum- 
mary information to be stored between synchronizations. It 
operates entirely at user level, without transaction logs or 
monitor daemons. It currently handles only two replicas at 
a time and is targeted toward Unix filesystems (though all 
but the update detector could be used with any operating 
system, and new update detection modules should be fairly 
easy to write). 

The user interface (see Figure 1) displays all the files in 
which updates have occurred, using a tree-browser widget; 
selecting a file from this tree displays its status in a detail 
dialog at the right and offers a menu of reconciliation op- 
tions. La the common case where a file has been updated 
in only one replica, an appropriate action is selected by de- 
fault and the tree listing shows an arrow indicating which 
direction the update will be propagated. If both replicas are 
updated, the tree view displays a question mark, indicating 
that the user must make some explicit choice. When the 
user is satisfied, a single button press fires all the selected 
actions. 

Internally, the implementation closely follows the recon- 
ciliation algorithm in Section 4.2 (see Figure 2). At the end 
of every synchronization, a summary of each replica is stored 
on the disk. The saved information includes the time when 
each file in the replica was last synchronized and its inode 
number at that time. At the beginning of the next syn- 
chronization, each update detector reads its summary and 
traverses the file system to detect updates. A file is marked 
dirty if its dime 1 or inode number has changed since the 
last synchronization. The reconciler then traverses the two 
replicas in parallel, examining the files for which updates 
have been detected on either side and posting appropriate 
records to a tree of pending actions maintained by the user 
interface. 

6 Examples 

To explore the utility of our specification, we now discuss 
some existing synchronizers in terms of the specification 
framework that we have developed. We do not attempt to 
provide a complete survey, just a few representative exam- 
ples. 

6.1 Briefcase 

Microsoft's Briefcase synchronizer [Bri98, Sch96] is part of 
Windows 95/NT. Its fundamental goals seem to match those 
embodied in our specification ("propagate updates unless 
they conflict, in which case do nothing by default") — indeed, 
even its user interface is fairly similar to our prototype. 
However, some simple experiments revealed several cases 
where Briefcase's behavior does not match what is predicted 
by our specification (or any similar specification that we can 
think of). 

l In Unix, a file's dime gets changed if the contents or the at- 
tributes (such as permission bits) of the file are changed. 
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Figure 1: User interface of our synchronizer 



The strangest example that we encountered runs as fol- 
lows. (Since it involves two successive synchronizations, it 
should be compared with the refined requirements discussed 
in Section 7.1.) Suppose we have a synchronized filesys- 
tem containing a directory (folder) a, a subdirectory o.o, 
and a file a.6./. Now, in one replica, we delete o and all 
its contents; in the other we modify the contents of a.6./ 
and add a new subdirectory a.c; then we synchronize. At 
this point, Briefcase reports that no updates are needed. 
(Strictly speaking, this behavior is correct, since it leaves 
both replicas unchanged, but a conflict should probably have 
been reported.) Now, in the second replica, we create a new 
file a.b.g y and synchronize again. This time, the synchro- 
nizer does propagate some changes: it recreates a in the first 
replica, adds subdirectories a.b and a.c, and copies a.b.g — 
but not a.b.f. Success is reported, but the two filesystems 
are not identical at the end. 

6.2 PowerMerge 

According to the manufacturer's advertising [Pow98], the 
PowerMerge synchronizer from Leader Technologies is "used 
by virtually every large Macintosh organization and is the 
highest rated file synchronization program on the market 
today." We tested the 'light" version of the program, which 
is freely downloadable for evaluation. 

Although the description of the program's behavior in 
the user manual again seems to agree with the intentions 
embodied in our specification, we were unable to make the 
program behave as documented. For example, deleting a file 
on one side and then resynchronizing would lead to the file 
being re-created, not deleted. Also, when both copies of a 



file have been modified, the most recent copy is propagated, 
discarding the update in the other copy. 

6.3 Rumor 

UCLA's Rumor project tRei97, RPG + 96] has built a user- 
level file synchronizer for Unix filesystems— probably the 
closest cousin to our own implementation. Although its 
capabilities go beyond what our specification can describe, 
Rumor (nearly) satisfies our specification in the two-replica 
case. (Rumor's model of synchronization originates from the 
Ficus replicated filesystem; much of our discussion regard- 
ing Rumor also applies to the synchronization mechanisms 
of Ficus [RPG+96, RHR+94, GPJ93].) 

In Rumor, reconciliation is performed by a local process 
in each replica, which works to ensure that the most recent 
updates to each file in other replicas are eventually reflected 
in the local state of this replica. For each file in the replica, 
Rumor maintains a version vector reflecting the known up- 
dates in all replicas. During reconciliation, this version vec- 
tor is compared with that of another replica (chosen by the 
user or determined by availability) to determine which has 
the latest updates. If the remote copy dominates, then the 
local copy is modified to reflect the updates; if the local copy 
dominates, then nothing more is done. (In essence, reconcili- 
ation in Rumor uses a "pull model": it is a one-way process.) 
If there is a conflict, Rumor invokes a resolver based on the 
type of the file; for instance, updates to Unix directories are 
handled by a "merge resolver" [RHR + 94]. Updates eventu- 
ally get propagated to all replicas by repeated "gossiping" 
between pairs of replicas. 

The update detection strategy in Rumor is a variant of 
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Figure 2: Internals of our synchronizer 



the modtime-inode strategy described in Section 3.2.4. Ru- 
mor's reconciliation process is more general than that de- 
scribed by our specification. However, it does appear to 
satisfy our specification if we consider the following special 
case. (1) There are exactly two Rumor replicas. (2) Both 
replicas are reconciled at the same time, each treating the 
other as the source for reconciliation. (3) Overlapping up- 
dates are handled by a simple equality check for files (by de- 
fault, Rumor considers updates to the same file in different 
replicas as a conflict, even if they result in equal contents) 
and a recursive merge resolver for directories. 

6.4 Distributed Filesystems 

Not surprisingly, our model of synchronization has some 
strong similarities to the replication models underlying 
mainstream distributed filesystems such as Coda [Kis96, 
Kum94], Ficus [RHR+94, PJG+97], and Bayou pPS + 94, 
TTP + 95]. Related concepts also have a long history in dis- 
tributed databases (e.g., [Dav84]). 

These systems differ from user-level file synchronizers — 
and from each other — along numerous dimensions... con- 
tinuous reconciliation vs. discrete points of synchronization, 
distinguishing or not between client and server machines, 
eager vs. lazy reconciliation, use of transaction logs vs. im- 
mediate update propagation, etc. Since explicit points of 
synchronization are not part of the user's conceptual model 
of these systems, our specification framework is not directly 
applicable. On the other hand, their underlying concepts of 
optimistic replication and reconciliation are fundamentally 
very similar to ours. The intention of synchronization — 
whenever and however it happens — is (eventually) to prop- 
agate nonconflicting updates and to detect and repair con- 
flicting updates. Our specification can therefore be viewed 
as a first step toward a more general framework in which 
such systems can be described and compared. 

One exception is the system described by Mazer and 
Tardo [MT94]. Their approach is quite similar to ours in 
that it includes explicit, user-invoked points of synchroniza- 
tion. Apart from the asymmetry in their setting between 
clients and servers, our framework could be used to model 
their system. 



6.5 Data Synchronizers 

Much of the engineering effort in commercial synchroniza- 
tion tools goes into facilities for data synchronization — 
merging updates to the same file in different replicas us- 
ing specific knowledge of the structure of the file based 
on its type (address book, calendar, etc.). Related ap- 
proaches have long been pursued in distributed database 
systems [Dav84]) and has resulted in products like Oracle's 
Symmetric Replication pDD + 94]. 

Surprisingly, at least some of these tools can be described 
very directly in our framework. For example, Puma Tech- 
nology's popular Intellisync [Puma, Pumb] can synchronize 
many kinds of databases between handheld PDAs, laptop 
computers, and workstations. It requires that one or more 
key fields be chosen for each type of database to be syn- 
chronized. (For example, in an address book the key fields 
might be the first and last name; in a calendar database 
they could be the date, time, and description of an appoint- 
ment.) These key fields correspond to the name of a file 
in our model. Changing the key fields is like moving the 
file; changing information in other fields is like changing the 
contents of the file. 

To describe Intellisync in our framework, we just need to 
generalize the notion of filesystem paths to include names 
for individual records within files by allowing combinations 
of key-field values as filename components (e.g., p = 
usrMp.phonebook.{lostname=Smiih f firstname=John}). 
The behavior described in the Intellisync manual then 
follows our specification quite closely. In fact, if we consider 
the operation of Intellisync just on a single database, then 
we may drop the clauses of our specification that deal with 
directories and describe its behavior even more succinctly: 

-* dirty A (p) 

=* C(p) = Dip) = B{p) 
-^dirty B {j>) 

=> C(p) =D(p) =A(p) 
dirty a (p) A dirty B (p) 

C(p) = A(p) A Dip) = Bip) 

6.6 Version Control Systems 

Another class of systems with some striking similarities to 
file synchronizers is version control or source control systems 
like CVS. Such systems include numerous features (version 
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histories, alternative branches, eta) that fall outside the 
scope of our specification, but their core behavior includes 
commands like "check in all changes in this group of files, 
except in cases where the changes conflict with changes that 
have already been checked in by another project member." 
Our requirements might be a useful starting point for full 
specifications of such systems. 

7 Extensions 

We close by sketching some extensions of our framework. 

7.1 Partially Successful Synchronization 

If it recognizes conflicting updates, the synchronizer may 
halt without having made the filesystems identical. Then, 
the next time the synchronizer runs, there will not be one 
original filesystem, but two. In general, particular regions 
of the filesystem may have been successfully synchronized 
at different times. We can easily refine our specification 
to handle this case. (Our implementation also handles this 
refinement.) 

Instead of assuming that the replicas had some common 
state O at the end of the previous synchronization, we intro- 
duce into the specification a new filesystem T, which records 
the contents of each path p at the last time when p was suc- 
cessfully synchronized. 

The specification of the update detector remains the 
same as before, except that the dirty predicate is defined 
with respect to T. That is, dirty s ip) must be true whenever 
p refers in S to something different from what it referred to 
at the end of the last successful synchronization of p. 

The reconciler is now extended with an additional out- 
put parameter: besides calculating the new states C and D 
of the two replicas, it returns a new filesystem T', which will 
be used as the T input to the next round of synchroniza- 
tion. For each path p, A(p) records the contents of p at the 
last point where p was successfully synchronized. Formally, 
we say that the triple (C,!),?') is said to be a synchro- 
nization of a pair of original filesystems (A, B) with respect 
to predicates diriyA and dirtys and original state T if, for 
each relevant path p in (A t B) t the following conditions are 
satisfied: 

-idirtyAip) 

=> C(p) = D(p) = £(p) = r'(p) 
->dirty B {p) 

C(p) = Dip) = A(p) = r(p) 

isdir At B(p) 

isdirc t D(p) A isdirft(p) 
dirtyA(p) A dirty sip) A -*isdir A ,B{p) 

C(p) = Aip) A D{p) = Bip) 
A iIA(p) = Bip) then T(p) = Aip) 
else T'(p) = Tip) 

7.2 Multiple Replicas 

In general, one may wish to synchronize several replicas on 
different hosts, not just two. We can generalize our require- 
ments specification to handle multiple replicas in a fairly 
straightforward way. 

Let Id = {1, 2, . . . ,n} be a set of tags identifying the n 
replicas to be synchronized. Let the set of original replicas 
to be synchronized be denoted by Ts = {Si \ i € Jd}. For 
any path p, let D Pt s be the set of identifiers of replicas that 



are dirty at p— i.e. f D VfS = {* | d»rty@ s .(p)}. A set of new 
replicas Tr = {Ri \ i € Id) is said to be a synchronization of 
Ts with respect to dirtiness predicates dirty@s { if, for each 
relevant path p in Fs % the following conditions are satisfied; 

D PtS - 0 

=> VieId.Ri<p) = Si(p) 
D PtS # 0 A Vt, i€JD Pf s. Si(p) = Sj(p) 

3j€JD P( s. Vie/d. Riip) = Sjip) 

isdirs{p) 

isdirnip) 

3ij€D PtS .Si(p)?Sj(p) A ^isdirsip) 
VieId.Ri(p) = Si(p) 

It is interesting to note that Coda's reconciliation strat- 
egy depends on a similar requirement. Coda has a certifi- 
cation mechanism which ensures that reconciliation is safe 
to proceed. Kumar [Kum94, pages 58-61] proves that, if 
certification succeeds at all servers, then for each data item 
d, either (i) d is not modified in any partition, (ii) the fi- 
nal value of d in each partition is equal to the pre-partition 
value, or (iii) d was modified in exactly one partition. 

In a multi-replica system, the process of reconciliation 
may in general only involve a subset of the replicas at one 
time. To describe the intended behavior in this case, we 
would need to combine the above specification with the re- 
finement described in Section 7.1. 

7.3 Additional Filesystem Properties 

A related generalization offers a natural means of extend- 
ing our simple model of the filesystem to include properties 
like read/write/execute permissions, timestamps, type in- 
formation, symbolic links, etc. For example, a symbolic link 
can be regarded as a special kind of file whose contents is 
the target of the link. Similarly, to handle permission bits 
for files, we take the contents of the file to include both its 
proper contents and the permission bits. 

Hard links are somewhat more difficult to handle, espe- 
cially if it is possible to create a hard link from inside a 
synchronized filesystem to some unsynchronized file. How- 
ever, if this case is excluded, it seems reasonable to handle 
hard links by annotating each filesystem with a relation de- 
scribing which files are hard-linked together and taking this 
additional information into account in the update detector 
and reconciler. 
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